-
Notifications
You must be signed in to change notification settings - Fork 13.7k
feature attn with state #14299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Zijie-Tian
wants to merge
90
commits into
ggml-org:master
from
Zijie-Tian:tzj/feature-attn-with-state
Closed
feature attn with state #14299
Zijie-Tian
wants to merge
90
commits into
ggml-org:master
from
Zijie-Tian:tzj/feature-attn-with-state
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Introduced `run-prefill-decode-bench.sh` for executing prefill-decode benchmarks with customizable parameters. - Added `extract_bench_results.py` to process benchmark markdown files and extract structured data into CSV format. - Updated `.gitignore` to include `bench_results` directory for generated files.
- Introduced `analyze_benchmark_results.py` for processing benchmark CSV files and generating performance pivot tables. - Updated `run-prefill-decode-bench.sh` to support multiple KV cache types and added options for prompt length and forced alignment. - Modified `extract_bench_results.py` to accommodate broader file matching for markdown files.
- Introduced `run_op_bench.sh` to execute Flash Attention benchmarks with customizable parameters for head sizes, KV lengths, and quantization types. - Added `summary_flash_attn.py` for processing benchmark results, extracting performance metrics, and generating analysis summaries. - Enhanced test cases in `test-backend-ops.cpp` to include additional KV lengths and quantization types for comprehensive performance evaluation.
- Introduced a new profiling feature for the ggml library to track operation timings within computation graphs. - Added `ggml-profile.h` and `ggml-profile.cpp` to define profiling structures and functions. - Updated `CMakeLists.txt` to include options for enabling the graph profiler. - Modified existing source files to integrate profiling calls during graph computations, allowing for performance analysis. - Enhanced `CMakePresets.json` with new presets for profiling builds.
- Added a function to enable or disable GGML graph profiling based on a specified path. - Updated the `test_gen` function to conditionally set profiling during the last generation iteration. - Ensured profiling is reset after each benchmark run in the main function. - Improved overall profiling integration for better performance analysis during benchmarks.
- Updated the output format in `ggml-profile.cpp` to use a CSV style for better readability and easier parsing. - Introduced a global variable in `llama-bench.cpp` to manage the GGML_GRAPH_PROFILE setting, allowing for dynamic configuration. - Added a function to retrieve the current value of the GGML_GRAPH_PROFILE environment variable, enhancing flexibility in profiling setup.
- Introduced `run-breakdown.sh` to facilitate operator breakdown profiling with customizable parameters such as model path, thread count, output directory, and prefill depths. - Updated `.gitignore` to exclude specific breakdown results files. - Enhanced `llama-bench.cpp` to support profiling during prefill and decode operations, improving performance analysis capabilities.
- Introduced `analyze_breakdown.py` to parse CSV files, analyze operator performance, and generate visualizations. - Implemented functions for data cleaning, operator analysis, and visualization in both bar and pie chart formats. - Added command-line interface for processing multiple CSV files or a specific file, with options for generating comparison charts across depths.
- Introduced `SKIP_ANALYSIS` flag to allow users to skip the data analysis step during profiling. - Updated help information to include the new flag and its default value. - Added a function to check for Python dependencies and provide warnings if they are missing. - Adjusted output to reflect changes in how results are displayed based on the new flag.
- Added T-MAC quantization types and configurations in the ggml library. - Enhanced the `convert_hf_to_gguf.py` script to support T-MAC options and quantization configurations. - Updated CMake files to include T-MAC compilation options and source files. - Introduced new utility functions for T-MAC handling in the gguf Python module. - Modified existing quantization logic to accommodate T-MAC types and ensure compatibility with the new formats. - Improved model loading and tensor operations to leverage T-MAC optimizations.
- Added T-MAC quantization types and validation in ggml.h and ggml-quants.c. - Updated type traits and tensor size calculations in ggml.c to accommodate T-MAC types. - Enhanced CMake configuration to conditionally include T-MAC source files based on compilation flags. - Modified llama model loader and quantization logic to support T-MAC types. - Ensured compatibility and proper handling of T-MAC types across various components.
- Adjusted the T-MAC type count in ggml.h to reflect the correct number of types based on compilation flags. - Updated CMakeLists.txt to ensure proper inclusion of T-MAC definitions and directories, removing unnecessary comments for clarity.
- Introduced a new test file `test-quantize-accuracy.cpp` to evaluate the accuracy of quantization and dequantization processes. - Updated `CMakeLists.txt` to include the new accuracy test in the build process, ensuring comprehensive testing of quantization functionalities.
- Introduced a new option for QlutAttn in CMake configuration to enable its usage. - Updated CMakeLists.txt to conditionally compile QlutAttn related definitions and include directories. - Enhanced the ggml-base target to support QlutAttn functionality, ensuring proper integration within the library.
- Added additional T-MAC quantization types to the kv_cache_types in arg.cpp. - Updated ggml.h to reflect the correct count of T-MAC types without conditional compilation. - Enhanced llama-graph.cpp to support new T-MAC types in the attention mechanism, ensuring compatibility with existing functionality.
- Introduced a new example `flash-attn-inspector` to demonstrate the usage of flash attention in LLaMA models. - Added corresponding CMake configuration to include the new example in the build process. - Implemented the main functionality in `flash-attn-inspector.cpp`, including tensor data handling and logging for debugging purposes. - Enhanced the testing framework with a new test target for evaluating callback functionality during inference.
- Added entries for `breakdown_results` and `breakdown_results_llamacpp` directories to the .gitignore file, ensuring that generated files from breakdown profiling are excluded from version control.
- Updated `.gitignore` to include `breakdown_results_llamacpp/` directory. - Added documentation files for `ggml_structure.mdc` and `project_structure.mdc` to provide an overview of the project and its components. - Introduced `python_scripts.mdc` to outline the usage of Python scripts within the project. - Added new test files: `test-flash-attn.cpp` and `test-mul-mat.cpp` to validate the functionality of flash attention and matrix multiplication operations. - Updated `CMakeLists.txt` to include new test targets for improved testing coverage.
ggml-ci
ggml-ci
- Introduced `ggml_cpu_structure.mdc` to detail the CPU-specific implementation of the GGML tensor library, including core source files, operation implementations, and architecture-specific optimizations. - Updated `ggml_structure.mdc` to reference the new CPU backend documentation, enhancing overall project clarity.
…ith quantized tensor support and improve computation graph handling
…flash_attn_ext_mixed-function Fix mask indexing in mixed flash attention and correct Q initialization
…efill-test-and-align_kv-mixed.sh Fix causal mask padding in flash decoding test
…ng and adding layer-wise K/V quantization operations. Improved logging for debugging and computation graph handling.
…0-quantization-007b Modify custom op for Q4_0 quantization
…ard-flash-attn-ext-f16-function-7321 Modify ggml_compute_forward_flash_attn_ext_f16 function
Implemented a function to convert ggml tensors to torch tensors using type traits, including support for various tensor types. Enhanced the dequantization function to utilize type traits for improved float conversion and added error handling for unsupported types. This update improves integration with PyTorch and facilitates better tensor management.
…sh-attn-ext-f16-febc Fixed ggml_compute_forward_flash_attn_ext_f16_with_state
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR